Search CORE

2 research outputs found

Evaluating Multiway Multilingual NMT in the Turkic Languages

Author: Ataman Duygu
Babu Anoop
Chellappan Sriram
Firat Orhan
Ivanova Sardana
Kreutzer Julia
Licato John
Mirzakhalov Jamshidbek
Moydinboyev Bekhzodbek
Pulatova Shaxnoza
Tyers Francis M.
Uzokova Mokhiyakhon
Wahab Ahsan
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2021
Field of study

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Large-Scale Study of Machine Translation in Turkic Languages

Author: Ataman Duygu
Babu Anoop
Chellappan Sriram
Firat Orhan
Hajili Mammad
Ivanova Sardana
Kariev Sherzod
Khaytbaev Abror
Laverghetta Jr. Antonio
Mirzakhalov Jamshidbek
Moydinboyev Bekhzodbek
Onal Esra
Otabek Abduraufov Otabek
Pulatova Shaxnoza
Tyers Francis M.
Wahab Ahsan
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2021
Field of study

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe

Helsingin yliopiston digitaalinen arkisto